Code

Imports

Read the whole dataset and reduce it to what we are interested in

Sort it by ['dobdb_family_id', 'earliest_publn_date']

Reduce to years we are interested in

In appln_abstract and appln_title: Replace NaNs with ' '

Infer our time frame from data

Of every family, keep only the last english, non-nan title and abstract

For Bruno: Of every family, keep only the last english, non-nan title and abstract and also save the respective family ID and year

Get titles and abstracts counts for each year

Write counts in a dataframe and normalise them

Define stopwords, contexts, equivalents, words to replace, and punctuation

Define a function for taking care of key phrases extraction and counting

Define a function for generating LaTeX code

Two more definitions

Results

Titles

Titles - unigrams

Titles - bigrams

Titles - trigrams

Abstracts

Abstracts - unigrams

Abstracts - bigrams

Abstracts - trigrams

Making letters in beginning of string capitals

Search certain strings

Conduct search and save results

Write counts to a .txt file

Compute union of title family IDs and abstract family IDs

Create labels for new table

Alternative labels

Compute unions and create table

Compute yearly aggregate

Compute each word group's total and sort from largest to smallest

Visualise yearly aggregate on the left and each word group's total on the right

Compute year over year increases

Compute share of circular IPFs out of all our IPFs

For each year

For whole time period

Compare IEA&EPO's li-ion and other lithium series to ours

Two analytical tasks:

[1]

[2]

Reduce dataset to rows that contain strings we are potentially interested in

Make lower case, punctuation removal, numbers removal, deal with multiple spaces, and add beginning of string marker and end of string marker

Get one dataframe for each string group with rows we are interested in only

Replace string with string group proxy

For each string, get trigrams